Imagine baseball-sized ice balls falling from the sky.
For some Americans, this is a real weather phenomenon that risks the physical well-being of people and property. But how often does it happen? Where do these events normally happen?
As it turns out, the National Oceanographic and Atmospheric Administration (NOAA) collects massive amounts of data on precipitation using radar stations across the country. Algorithms can be built on top of the data to track ‘hail signatures’ detecting when these events occur and the severity. This severe storm data is captured in NOAA's Severe Weather Data Inventory (SWDI) housed within the National Centers for Environmental Information (NCEI).
Looking at over 9 million events over the last 10 years, we can see that most of these events happen in the Midwest.
A closer look at severe events (e.g. hail diameter > 3in), the majority of these events occur in a small snaking section of the country.
Getting Started
In this code notebook, we will:
To get started quickly, the code for this tutorial can be found at the following Github repo (here).
Leading off
Start off by specifying the working directory as well as calling 6 libraries:
library(sqldf)
library(RColorBrewer)
library(leaflet)
library(ggplot2)
library(DT)
library(rgdal)
library(sp)
Loading Data
SWDI API
The SWDI data covers a number of data series that use event detection algorithms on NEXRAD radar data. The main data series are as follows:
We’ll use the SWDI API to extract all hail signatures between 2005 and 2015 in 30 day incremenets as the maximum number of days per API call is 31 days or 744 hours. The wrapper ‘swdi_pull’ will need three parameters:
swdi_pull <- function(start_date,end_date,series){
##translate the string into a date, and range
start <- as.Date(start_date,"%Y-%m-%d")
range <- as.numeric(as.Date(end_date,"%Y-%m-%d")-as.Date(start_date,"%Y-%m-%d"))
##Placeholder for the result
raw <- data.frame()
##Loop through Day 0 through the full range of days
for(i in seq(0,range,30)){
##Load in parameters, hit API
print(i)
period <- start + i
increment0 <- paste(format(period,"%Y"),format(period,"%m"),format(period,"%d"),sep="")
increment1 <- paste(format(period+30,"%Y"),format(period+30,"%m"),format(period+30,"%d"),sep="")
temp <- read.csv(paste("http://www.ncdc.noaa.gov/swdiws/csv/",series,"/",increment0,":",increment1,sep=""))
##If the API kicks back a result
if(ncol(temp)!=1 && colnames(temp)[1]!="summary"){
raw <- rbind(raw,temp)
raw <- raw[!is.na(raw$LAT),]
}
}
##Clean up time steps -- remove data outside of specified period
raw$DATE<-as.Date(substr(raw$ZTIME,1,10),"%Y-%m-%d")
raw<-raw[raw$DATE<=as.Date(end_date,"%Y-%m-%d"),]
raw$HOUR<-substr(raw$ZTIME,12,13)
raw<-raw[,c("ZTIME","DATE","HOUR","WSR_ID","CELL_ID","PROB","SEVPROB","MAXSIZE","LAT","LON")]
return(raw)
}
Now, we can tap the API for data. We’ll first specify the parameters.
##Set Parameters
start_date = "2005-01-01"
end_date = "2014-12-31"
range <- as.Date(end_date,"%Y-%m-%d")-as.Date(start_date,"%Y-%m-%d")
series = "nx3hail"
fraction = 0.25
And using those parameters, we can now draw down the data. This will take a while – about 10 million records.
raw <- swdi_pull(start_date,end_date,series)
Geographic Files
As the SWDI data is point-level data that will be processed into equal-interval grid points, we will want to add spatial context to the data by spatially joining points to county boundary files. The US Census Bureau provides boundary shapefiles through their website (http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_20m.zip). To efficiently load it in, we’ll write a simple function to download, unzip and load a shapefile.
shape_direct <- function(url, shp) {
temp = tempfile()
download.file(url, temp) ##download the URL taret to the temp file
unzip(temp,exdir=getwd()) ##unzip that file
return(readOGR(shp,shp))
}
To run the shape_direct function, we just need the url and the shapefile name.
shp <- shape_direct(url="http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_20m.zip",
shp= "cb_2014_us_county_20m")
## OGR data source with driver: ESRI Shapefile
## Source: "cb_2014_us_county_20m", layer: "cb_2014_us_county_20m"
## with 3220 features
## It has 9 fields
In addition, we’re going to pull in a reference table that links Federal Information Processing System (FIPS) codes that contain numeric identifiers for states. This’ll be useful for clearly indicating in plain language which counties are in a given state.
fips <- read.delim("http://www2.census.gov/geo/docs/reference/state.txt",sep="|")
fips <- fips[,c("STATE","STUSAB")] ##Keep the FIPS code and state abbreviation
fips$STATE <- as.numeric(fips$STATE) ##Convert FIPS code to numeric
fips$STUSAB <- as.character(fips$STUSAB) ##Convert state name to character
Basic Processing
At this point, we have all the data (hail signatures, county shapefiles, FIPs codes).
paste("Number of records in Hail data: ",nrow(raw),sep="")
## [1] "Number of records in Hail data: 9285845"
paste("Number of counties in shapefile: ",nrow(as.data.frame(shp)))
## [1] "Number of counties in shapefile: 3220"
Now, we can begin the process down the data. The first issue is to convert hail signatures from events to regularly spaced grid points at the daily level. As two or more radar stations may detect the same hail event at the same time, this basic gridding process will help to reduce double counting as well as allow for calculating a climatology.
To start, we’ll set a bounding box of the continental US in order to focus on non-island effects.
#Cut down bounding box
raw <- raw[raw$LON<(-50) & raw$LON>(-140) & raw$LAT > 25,]
Also, the hail event coordinates will be rounded to the nearest fraction as specified in the starting parameters. In this case, the fraction is 1/4 of a degree or about 17.25 miles latitudinally. Using SQL, we will group hail events by date, lat, lon, and hail size. Then based on the hail size, we’ll produce dummy variables are produced for each 2+ inch and 3+ inch thresholds. For context, that a baseball is approximately 2.9 inches.
##Round coordinates
raw$LON <- round(raw$LON/fraction)*fraction
raw$LAT <- round(raw$LAT/fraction)*fraction
##De-duplicate by day, latitude and longitude
deduped_day <- sqldf("SELECT DATE, LON, LAT, MAXSIZE
FROM raw
GROUP BY DATE, LON, LAT, MAXSIZE")
## Loading required package: tcltk
##Dummy variable (and 3+ in)
deduped_day$lvl_3in<-0
deduped_day$lvl_3in[deduped_day$MAXSIZE>3]<-1
Based on the de-duplicated daily, gridded data, we’ll now group by once again by lat and lon coordinates. This time, we’ll count the number of records per gridpoint (cnt = any hail event) as well as sum the dummy variables for 3+ inch (cnt_3) events. These count variables are then normalized by the number of days specified in the API call (range).
##Daily gridded frequencies
singles <- sqldf("SELECT LON, LAT,COUNT(date) cnt, SUM(lvl_3in) cnt_3
FROM deduped_day
GROUP BY LON, LAT")
##Normalize
for(i in 3:ncol(singles)){
singles[,i]<-singles[,i]/as.numeric(range)
}
Get Mapping!
Processing for the Tooltip
Knowing where risks occur is not sufficient. A good risk map contains context. To do this for grid points, we’ll spatially join the grid points to the closest county, that is see which county a grid point falls. It’s not a perfect match, but gives some basic context for the local area.
To start, we need to convert the gridded data into a shapefile and define the projection as WGS84 – one of the most common spatial projections for global data.
##Set up spatial
points<-singles
coordinates(points)=~LON+LAT
proj4string(points)=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84")
shp <- spTransform(shp, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84") )
##Spatial join -- really simple!
a<-over(points,shp)
##Join the results back to singles
singles <- cbind(singles,a[,c("STATEFP","NAME")])
singles$NAME <- as.character(singles$NAME)
#Merge state abbreviation via FIPS
singles$STATEFP <-as.numeric(singles$STATEFP)
singles <- merge(singles,fips,by.x="STATEFP",by.y="STATE",all.x=T, sort=F)
singles$loc <- paste(singles$NAME,", ",singles$STUSAB,sep="")
singles$loc[is.na(singles$NAME)]<-""
singles <- singles[!is.na(singles$STUSAB),]
For the popup, we’ll need to create a separate subset for 3+ inch events, vectorize the fields, and combine fields for the popup message using paste0. The message needs to be written in HTML as this will be directly rendered in browser.
#Subset data into separate frames (needed for layers in leaflet.js)
in3 <- singles[singles$cnt_3>0,]
#Vectorize data for inclusion for popup text
#All points
county_all <- singles$loc
x_all <- singles$LON
y_all <- singles$LAT
pr <- round(100*singles$cnt,3)
#Hail balls > 3in
county_3 <- in3$loc
x3 <- in3$LON
y3 <- in3$LAT
pr3 <- round(100*in3$cnt_3,3)
#Popup
content_all <- paste("<h3>",county_all,"</h3>Grid Point: ",x_all,", ",y_all,"<p>Prob of Any Hail <span style='color:red'><strong>: ",pr,"%</strong></span></p>")
content_3 <- paste0("<h3>",county_3,"</h3>Grid Point: ",x3,", ",y3,"<p>Prob of Hail (> 3in)<span style='color:red'><strong>: ",pr,"%</strong></span></p>")
Color Palettes
For each series, we will need two color schemes that are sscaled to the range and variability in a data series. The color palette “Set1” used is pulled from the RColorBrewer package. The color scheme is high contrast and divergent, perfect for maps with varying levels of activity.
pal <- colorNumeric(
palette = "Set1",
domain = pr
)
pal2 <- colorNumeric(
palette = "Set1",
domain = pr3
)
Final Maps
To start, we’ll initialize a leaflet.js map, setting the view centered on New York City at zoom 10 (higher numbers = closer in), attributing the layer to CartoDB. With this basic code, we can build out maps nationally of all hail events and 3+ inch events.
leaflet(width="100%") %>%
setView(lat = mean(y_all), lng = mean(x_all),4) %>%
addTiles('http://{s}.basemaps.cartocdn.com/dark_all/{z}/{x}/{y}.png',
attribution = "NOAA NCEI SWDI, US Census Bureau TIGER, Cartodb basemap") %>%
addCircleMarkers(data = singles, lat = ~ LAT, lng = ~ LON,radius=(pr/5),
fillOpacity = 0.8,stroke = FALSE,
color = ~pal(pr), popup = content_all) %>%
addLegend("bottomright", pal = pal, values = pr,
title = "Prob(Hail)",labFormat = labelFormat(suffix = "%")
)
leaflet(width="100%") %>%
setView(lat = mean(y_all), lng = mean(x_all),4) %>%
addTiles('http://{s}.basemaps.cartocdn.com/dark_all/{z}/{x}/{y}.png',
attribution = "NOAA NCEI SWDI, US Census Bureau TIGER, Cartodb basemap") %>%
addCircleMarkers(data = in3, lat = ~ LAT, lng = ~ LON, radius=2*(pr3),
fillOpacity = 0.8,stroke = FALSE,
color = ~pal2(pr3), popup = content_all) %>%
addLegend("bottomright", pal = pal2, values = pr3,
title = "Prob(Hail diameter > 3in)",labFormat = labelFormat(suffix = "%")
)